Internet Info 1997 December

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1997 December / Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso / ietf / urn / urn-archives / urn-ietf.archive.9610 / 000109_owner-urn-ietf _Thu Oct 24 17:02:28 1996.msg < prev next >

Wrap

Internet Message Format | 1997-02-19 | 7KB

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id RAA19325 for urn-ietf-out; Thu, 24 Oct 1996 17:02:28 -0400 Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id RAA19316 for <urn-ietf@services.bunyip.com>; Thu, 24 Oct 1996 17:02:23 -0400 Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA00362 (mail destined for urn-ietf@services.bunyip.com); Thu, 24 Oct 96 17:02:13 -0400 Received: from ifi.unizh.ch by josef.ifi.unizh.ch id <01664-0@josef.ifi.unizh.ch>; Thu, 24 Oct 1996 23:02:14 +0100 Subject: Re: [URN] Unicode for NSS query To: jayhawk@ds.internic.net Date: Thu, 24 Oct 1996 23:02:13 +0100 (MET) Cc: urn-ietf@bunyip.com In-Reply-To: <9610242040.AA05679@qsun2.ho.att.com.qsun2> from "jayhawk@ds.internic.net" at Oct 24, 96 03:46:15 pm Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 5224 From: Martin J Duerst <mduerst@ifi.unizh.ch> Message-Id: <"josef.ifi..881:24.09.96.22.02.15"@ifi.unizh.ch> Sender: owner-urn-ietf@services.bunyip.com Precedence: bulk Reply-To: Martin J Duerst <mduerst@ifi.unizh.ch> Errors-To: owner-urn-ietf@bunyip.com Ryan wrote: > >On Thu, 24 Oct 1996 21:47:41 +0100 (MET), Martin J Duerst wrote: > >>> >>>On Thu, 24 Oct 1996, Martin J Duerst wrote: >>> >>>> >>>> I have suggested that we might be required to thing in terms of >>>> both characters and octets. For some things, similar to a data: >>>> URL, thinking in characters might be artificial. For some >>>> other things, such as URLs, thinking in octets may to some >>>> extent be necessary because of backwards compatibility issues >>>> (assume an URL scheme is extended and decides to use some >>>> weird RFC 1522-like method for encoding characters, and this >>>> would have to be grandfathered). >>>> >>> >>>I have thought about the same thing, and I admit that I am not altogether >>>enthusiastic about RFC 1522 style encoding of URNs, but it may be forced >>>upon us. >> >>Well, I didn't mean that this is currently the case. The ftp >>i18n extensions, for example, are happily going in the direction >>of UTF-8 and would very well fit together with our proposal. >> >>But I have made the experience that trying to specify UNicode only >>can meet quite some resistance. Some people, for whatever reasons, >>are strongly anti-unicode. If you specify something like >>"urns use Unicode and nothing else", they may start to complain. >>If you say "urns should use Unicode, and clients should interpret >>urns as Unicode if possible, but if really necessary, an NSS >>can use something else" then it's difficult to oppose, even >>if maybe in actual practice, there will not be a single NSS >>that ever specifies something else than Unicode. > >Huh? My "knee-jerk" reaction is that by saying that client should convert from >an NSS that isn't in Unicode(10646)/UTF-8 shall be converted before >being sent to the URN resolver handles this nicely. I include in this statement >the act of entering a NSS at an interface in some other encoding (ASCII, etc.) Yes, but sometimes you would just not know what charset the stuff is in. Note that there are two things: - The user types in an urn into an application that locally uses some other encoding, which then gets converted to UTF-8 at the moment the application knows this is an URN. - There is an NSS that for some reasons, and maybe for some of its namespace only, doesn't have a clue about what characters it's dealing with, or explicitly wants to have the relation between characters and octet values different that UTF-8 for some reason whatsoever. It is the later that I am speaking about, not the former. >>NSSs or URL shemes (I still have problems distingushing them) >>themselves might also do the same thing, namely saying that >>in general, character semantics should be ISO 10646/UTF-8, but >>that other things would be tolerated for backwards compatibility. >>This is exactly the case for ftp, where a lot of 8-bit filenames >>are already existing and in use, although not UTF-8. > >I'm not sure I'm comfortable with this. My view is that the documentation of the NID >would specify this as an exception to the syntax document and the syntax document >says "here's the syntax, but exceptions are allowed for namespaces that document >there exceptions elsewhere? Seems awfully kludgy to me... Yes, in some sense it is. But this is reality, which is not always simple and logic. Take ftp as an example. Although not officially standardised, a lot of ftp hosts have files with 8-bit octets in their names. Although you can make some guesses about what characters these octets stand for, you are really not sure. Even on the same server, or in the same URL, different files, or different parts, may have different encodings, because the encodings reflect the settings of the users that created these filenames. And it is impossible to force upon all existing ftp servers that they suddenly know what characters their filenames represent if up to now URLs have been octets and nothing else. The only thing we can hope for, and which we are working for in the ftp-wg, is a gradual transition towards UTF-8. >>It is also important because not every arbitrary sequence of >>8-bit octets is an UTF-8 sequence. What should browsers, resolvers, >>and all the other components of our URN arichitecture do with >>these? > >I expect that anything that tries to resolve a sequence of 8-bit octets that aren't a >UTF-8 sequence should fail, unless the namespace has provided information otherwise. That might be a solution, unless there are resolvers that do resolution without knowing much about specific namespaces. But for existing URL schemes, the assumption would have to be that they implicitly specify otherwise. >I see this as related to the above issue: The two basic choices are: > >(1) Allow things other than UTF-8 (and force the exceptions to be documented as >part of namespace registration). The resolvers then have to use the namespace >specific information (including exceptions). (2) Do not allow other than UTF-8 and >reject the arbitrary 8-bit octet sequence as unresolvable if it is unresolvable. > >The first is cleaner from the point of view of the end-users, the second from the point >of view of internal clenliness. The second, at any rate, is not very practical. Regards, Martin.